Speedup `process_uc_wildcards` #193

SamRWest · 2024-02-21T05:18:54Z

Speeds up uc_wildcard processing by about 40x (8s -> 0.18s on Ireland benchmark) by omitting repeated regex matches, vectorising with merge and omitting some string processing.

The difference isn't obvious in the benchmark times because wildcards are a relatively small portion of the total runtime for smaller models, but it should have a significant impact on full model runs.

This is ready for a review now :)

…iants (#173)

…tional, e.g. ctslvl and ctype from ~FI_COMM

benchmarks always compare to HEAD of main branch

# Conflicts: # .gitignore # pyproject.toml # utils/run_benchmarks.py # xl2times/__main__.py # xl2times/datatypes.py # xl2times/excel.py # xl2times/transforms.py # xl2times/utils.py

obscure _match_uc_wildcards bugfix

olejandro · 2024-02-21T06:13:05Z

btw, if you convert it to draft PR the CI will run on every push, as well.

xl2times/transforms.py

SamRWest · 2024-02-21T21:59:26Z

xl2times/utils.py

 def remove_positive_patterns(pattern):
    return ",".join([word[1:] for word in pattern.split(",") if word[0] == "-"])


+@functools.cache


Caches old results instead of recalculating the wildcard patterns that are often repeated across tables. Seems to give a small (a couple of percent) speedup.

Nice! Just wondering how large the cache gets for a big model like Ireland/Austimes? Wondering if we'd need to use @functools.lru_cache(maxsize=...) at some point instead. Perhaps something to keep in mind if memory usage is ever a problem.

xl2times/transforms.py

organised imports, appeased linters

siddharth-krishna

Looks great, thanks again!

xl2times/transforms.py

siddharth-krishna · 2024-02-22T05:45:56Z

xl2times/utils.py

 def remove_positive_patterns(pattern):
    return ",".join([word[1:] for word in pattern.split(",") if word[0] == "-"])


+@functools.cache


Nice! Just wondering how large the cache gets for a big model like Ireland/Austimes? Wondering if we'd need to use @functools.lru_cache(maxsize=...) at some point instead. Perhaps something to keep in mind if memory usage is ever a problem.

siddharth-krishna · 2024-02-22T05:53:34Z

utils/run_benchmarks.py


-        summary = run(parse_args(args))
+    # Call the conversion function directly
+    summary = run(parse_args(args))


Hmm, thinking about it again, perhaps there is one reason to use subprocess -- at least in the CI should we check that xl2times works as expected from the command line? But on the other hand, run(parse_args( is pretty much the same as the CLI invocation, and CI is probably faster without subprocess...

I'm undecided, so would love your thoughts. And we can leave it as is in this PR and discuss in an issue, maybe?

siddharth-krishna · 2024-02-22T05:54:57Z

utils/run_benchmarks.py

-    with ProcessPoolExecutor(max_workers=max_workers) as executor:
-        results = list(executor.map(run_a_benchmark, benchmarks))
+    if debug:
+        # bypass process pool and call benchmarks directly if --debug is set.


Would love to have this documented in the CLI help for --debug, thanks!

xl2times/transforms.py

siddharth-krishna · 2024-02-22T06:17:53Z

xl2times/transforms.py

+    )
+    matches = [
+        df.iloc[:, 0].to_list() if df is not None and len(df) != 0 else None
+        for df in matches


Just curious about this code: is matches a list of DataFrames at the end of the previous statement? Perhaps the FIXME can be resolved if we change the unique_filters.apply(lambda row: ... line to something like:

matches = [matcher(row, dictionary) for _index, row in unique_filters.iterrows()]

reused process/commodities maps updated --debug doc

# Conflicts: # xl2times/transforms.py

…eature/wildcard_speedup

Now I remember why I used subprocess to call `xl2times` from `run_benchmarks.py`: in the CI, when it switches to the main branch, it needs to run the main branch's version of the tool. But if we call the tool as a python function, I think some of the PR version of the tool remains in memory, and we don't get what we want. This looks to be the reason CI is failing on this PR #183 : https://github.com/etsap-TIMES/xl2times/actions/runs/8020431584/job/21910275023 (The error is that it can't find a file in `xl2times/config/...` that was added by the PR when it is running tests in the main branch.) This PR undoes the change from #193 that removed the use of subprocess. See also https://github.com/etsap-TIMES/xl2times/pull/193/files?diff=unified&w=0#r1498683705

SamRWest added 15 commits February 16, 2024 12:08

Remove rows with duplicate query cols and cleanup handling of TFM var…

ef30184

…iants (#173)

made output parsing regex more robust

b892397

fixed parse_result

0eb5bdd

Add loguru, poe and poe shortcuts

158ce33

support merging tables (as VEDA appears to) where come columns are op…

783cb88

…tional, e.g. ctslvl and ctype from ~FI_COMM

Fixed indentation

df82670

WIP prototype of more efficient uc_wildcards transform

6cfe42e

Working prototype, ~10-20x speedup

ed82d3d

Switched to ireland for unit test data

e12dfe6

formatting

dc07fae

cleanup

d199eeb

extra --debug logic

5cdda5f

benchmarks always compare to HEAD of main branch

Merge branch 'main' into feature/wildcard_speedup

f5da625

# Conflicts: # .gitignore # pyproject.toml # utils/run_benchmarks.py # xl2times/__main__.py # xl2times/datatypes.py # xl2times/excel.py # xl2times/transforms.py # xl2times/utils.py

post merge fixes

a4d43fa

obscure _match_uc_wildcards bugfix

fix import

e388130